In this project the red wine data will be analysed. The main aim of the project is to understand which of variables in the dataset impact the quality of the wine. This will be understood by performing Exploratory Data Analysis(EDA) on the dataset. We will perform Univariate analysis, Bivariate Analysis and Multivariate analysis on the variables to understand the data and variables.
## [1] "C:/udacity"
The data has been loaded into the redWineData, we will be running the str function on the dataset to view the variables present.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## [1] 1599 13
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
Output variable (based on sensory data): 12 - quality (score between 0 and 10)
## 'data.frame': 1599 obs. of 15 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ rating : Ord.factor w/ 3 levels "bad"<"average"<..: 2 2 2 2 2 2 2 3 3 2 ...
## $ total_acidity : num 8.1 8.68 8.56 11.48 8.1 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality rating total_acidity
## Min. : 8.40 Min. :3.000 bad : 63 Min. : 5.120
## 1st Qu.: 9.50 1st Qu.:5.000 average:1319 1st Qu.: 7.680
## Median :10.20 Median :6.000 good : 217 Median : 8.445
## Mean :10.42 Mean :5.636 Mean : 8.847
## 3rd Qu.:11.10 3rd Qu.:6.000 3rd Qu.: 9.740
## Max. :14.90 Max. :8.000 Max. :16.285
The individual variables will be analysed before finding their impact on the quality of wine. This will help us understand the nature of each variable.
Import all the required libraries
We will be plotting a plot to understand the quality distribution for the dataset. Here is plot for quality:
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1599 5.64 0.81 6 5.59 1.48 3 8 5 0.22 0.29
## se
## X1 0.02
## nbr.val nbr.null nbr.na min max
## 1.599000e+03 0.000000e+00 0.000000e+00 3.000000e+00 8.000000e+00
## range sum median mean SE.mean
## 5.000000e+00 9.012000e+03 6.000000e+00 5.636023e+00 2.019555e-02
## CI.mean.0.95 var std.dev coef.var
## 3.961255e-02 6.521684e-01 8.075694e-01 1.432871e-01
From the above stats and plot it is noted that the maximum observations are rating 5-7. There are no observations with a score of 10.
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1599 8.32 1.74 7.9 8.15 1.48 4.6 15.9 11.3 0.98 1.12
## se
## X1 0.04
The plot shows that the fixed acidity has almost normal distribution. The mean value for the fixed acidity in the dataset is 8.32
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1599 0.53 0.18 0.52 0.52 0.18 0.12 1.58 1.46 0.67 1.21
## se
## X1 0
The plot for volatile acidity also shows similar characterstics to fxed acidity with normal distribution. The mean for this variable is 0.53
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1599 3.31 0.15 3.31 3.31 0.15 2.74 4.01 1.27 0.19 0.8
## se
## X1 0
The plot shows a bit of right skewness. The mean of the pH distribution is at 3.31.
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 1599 0.27 0.19 0.26 0.26 0.25 0 1 1 0.32 -0.79 0
The graph also shows a right sided tail. The mean of the citric acid distribution is 0.27. Most of the observations show zero value for citric acid variable, as the spike at 0 is the maximum.
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1599 2.54 1.41 2.2 2.26 0.44 0.9 15.5 14.6 4.53 28.49
## se
## X1 0.04
The residual sugar distribution is highly right skewed. The mean for the distribution is 2.54.
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1599 0.09 0.05 0.08 0.08 0.01 0.01 0.61 0.6 5.67 41.53
## se
## X1 0
The chloride distribution also shows similar charaterstics in terms of skewness as that od residual sugar. the maximum observations have value less than 0.1.
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1599 15.87 10.46 14 14.58 10.38 1 72 71 1.25 2.01
## se
## X1 0.26
The sulfur dioxide distribution is right skewed with a mean of 15.87.
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1599 46.47 32.9 38 41.84 26.69 6 289 283 1.51 3.79
## se
## X1 0.82
The total sulfur dioxide distribution is rihght skewed with a mean of 46.47
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 1599 0.66 0.17 0.62 0.64 0.12 0.33 2 1.67 2.42 11.66 0
The sulphate distribution is right skewed with outliers and mean of 0.66.
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 1599 1 0 1 1 0 0.99 1 0.01 0.07 0.92 0
The density distribution is normal distribution with mean of 1. The medium and mean values are the same for this distribution.
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1599 10.42 1.07 10.2 10.31 1.04 8.4 14.9 6.5 0.86 0.19
## se
## X1 0.03
The alchol distribution shows that for majority of the observations the alcohol percent is between 9 and 11. The mean is 10.42
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1599 8.85 1.7 8.45 8.69 1.42 5.12 16.29 11.17 0.97 1.23
## se
## X1 0.04
The total acidity shows approximately normal distribution the mean is 8.85.
From above plot it can been observed that the majority of the observations fall under the avrage rating category. There are very few observations that fall under the good and bad category. This will lead to a difficulty to find the variables that have impact on the quality of the wine.
The variables alcohol,density,pH,fixed acidity,volatile acidity and citric acid are normally distributed as the the skewness is closer to 0. The variables sulphate,total sulphur dioxide and free sulphur dioxide are slightly positively skewed in distribution The variables chlorides and residual sugar are highly positive skewed distribution with outliers present in the extreme. The citric acid variable has a large number of zero values. Quality variable has maximum data in the average category (5 to 7), there are very few observations for good(>7) and bad (0-4) quality of wine. For the maximum number of observations it is seen that the alcohol value is between 9 and 11.
The dataset has 13 variables and 1599 observation. The variables in the dataset are
fixed.acidity
volatile.acidity
citric.acid
residual.sugar
chlorides
free.sulfur.dioxide total.sulfur.dioxide density
pH
sulphates
alcohol
quality
The variable of interest is quality. We want to study the ariables that have impact on quality of wine.
The expectation is that citric acid,ph,residual sugar, alcohol and total acidity will contribute to the investigate the quality of wine. These factors contribute to the taste of wine determining its quality. So may be the mentioned variables to contribute to its impact on quality.
Yes , 2 new variables have been created. total_acidity, this the summation of the volatile acidity and fixed acidity as these 2 variables together determine the acidity of the wine. The second varaible created is rating, this categorises the wines based on their quality score in bad,average and good categories.
The variables alcohol,density,pH,fixed acidity,volatile acidity and citric acid are normally distributed as the the skewness is closer to 0. The variables sulphate,total sulphur dioxide and free sulphur dioxide are slightly positively skewed in distribution The variables chlorides and residual sugar are highly positive skewed distribution with outliers present in the extreme. The citric acid variable has a large number of zero values.
The x axis and y axis have been set to limits to have a closer view of the data. Plots with and after removing outliers have been plot to understand the distribution of data.
We will first ploat a scatterplot matrix, to understand the relation between 2 variables.
We will first try to figure the corelation coeffs for all the variables with quality.
In the plot above it can be seen * Citric acid has a positive corelation with quality. * Volatile acidity has a negative corelation with quality. * Residual Sugar, fixed acidity and chlorides have weak relation with the quality variable. * Itcan be seen that citric acide has strong relation with fixed and volatile acidity.
In the plot above it can be seen * Alcohol has a positive corelation and strongest relation with quality. * Total sulfur dioxide has strong relation with free sulfur dioxide. * Sulphates have a positive corelation with quality. * pH, total acidity and sulfur dioxide have weak relation with teh quality variable.
Below we will be plotting graphs with different variables and fixing the y=axis to quality to understand the effect on the quality of wine.
From the above plot it can be seen there exists a strong relation between alcohol and quality.
The volatile acidity has a negative relationship with quality.
The above plot shows that there is no strong relation between residual sugar and quality.
The above plot shows there is no strong relation between pH and quality.
The above plot shows there is a strong positive relation between sulphates and quality.
The exists a positive relataion between citric acid and quality.
Now we will plot graphs for other variables to understand their relationships.
There exists a strong negative relationship between citric acid and volatile acidity.
There exists a strong positive relationship between fixed acidity and citric acid.
There exists a negative relationship between total sulphur dioxide and quality.
There exists no strong relationship between free sulphur dioxide and quality.
There exis a weak negative relationship between density and quality.
Now that there is a relation between the 4 variables and quality we will plot a box plot showing the content of the variables in the rating column
From the above plots it is observed that for good quality wines the alcohol content is high. The volatile acidity is less for good quality wines. The citric acid content is a bit high in good quality wines when compared to bad and average qualities. The sulphates quality wines are between 0.5 and 0.8 for maximum observations
There are no variables that display strong relationship with quality. Still there is relationship between alcohol,sulphates,volatile acidity, citric acid and quality Also there is very strong relation observed between citric acid and fixed and volatile acidity Better wines seem to have higher concentration of Citric Acid. Better wines seem to have higher alcohol percentages. Residual sugar has no impact on quality.
It has been observed that there is a strong relation between pH and total acidity. Also there has been a strong relation observed between citric acid and fixed acidity and citric acid and volatile acidity.
Relative to quality, alcohol had the strongest relation. Relative to all other ‘different’ variables citric acid and fixed acidity have strong relation.
Now we will plot multiple variable plots to conclude on the factors that impact wine quality.
We have seen that alcohol has a strong relation with quality, hence we will try to plot different variables with alocohol and quality and try to understand if any of them together have impact on the quality of wine.
There is strong negative relationship between alchol and density.
The sulphates amount when less and alcohol amount high produces high quality wines.
Residual sugar has a weak relationship with alcohol.
From the above plot we can see a postive relationship between pH and alcohol.
Now we will plot graphs by fixing the acidity, this will help us to understand relationships of other variables apart from quality
It can be seen that citric acid and fixed acidity have a strong relationship.
It can be seen that the rsidual sugar does not have strongrelation with fixed acidity.
It can be noted in the above plot that density and fixed acidity when low produce wines with quality score of 8.
Quality is high when volatile acidity and density are low Quality gets high with more alcohol and less sulphates Wine has good quality when the amount of alcohol is more and volatile acidity is less. Density has the weakest correlations with quality Residual sugar has no impact on quality
From above plots it can be noted that when chloride, sulphates, volatile acidity and citric acid when amount is less and the alcohol amount is high produces good quality wines.
It can be seen that high density amount produces bad quality wines.
There is no impact of quality due to residual sugar.
We will try to plot a linear model based on the data we have analysed so far:
plt1 <- lm(quality ~ alcohol, data = redWineData) summary(plt1)
##
## Call:
## lm(formula = quality ~ alcohol, data = redWineData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8442 -0.4112 -0.1690 0.5166 2.5888
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.87497 0.17471 10.73 <2e-16 ***
## alcohol 0.36084 0.01668 21.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7104 on 1597 degrees of freedom
## Multiple R-squared: 0.2267, Adjusted R-squared: 0.2263
## F-statistic: 468.3 on 1 and 1597 DF, p-value: < 2.2e-16
##
## Calls:
## m1: lm(formula = (quality ~ alcohol), data = redWineData)
## m2: lm(formula = quality ~ alcohol + citric.acid, data = redWineData)
## m3: lm(formula = quality ~ alcohol + citric.acid + chlorides, data = redWineData)
## m4: lm(formula = quality ~ alcohol + citric.acid + chlorides + residual.sugar,
## data = redWineData)
## m5: lm(formula = quality ~ alcohol + citric.acid + chlorides + residual.sugar +
## total_acidity, data = redWineData)
## m6: lm(formula = quality ~ alcohol + citric.acid + chlorides + residual.sugar +
## total_acidity + sulphates, data = redWineData)
##
## ======================================================================================================
## m1 m2 m3 m4 m5 m6
## ------------------------------------------------------------------------------------------------------
## (Intercept) 1.875*** 1.830*** 2.056*** 2.085*** 2.000*** 1.719***
## (0.175) (0.171) (0.186) (0.187) (0.233) (0.228)
## alcohol 0.361*** 0.346*** 0.333*** 0.334*** 0.336*** 0.311***
## (0.017) (0.016) (0.017) (0.017) (0.017) (0.017)
## citric.acid 0.730*** 0.798*** 0.814*** 0.767*** 0.549***
## (0.090) (0.092) (0.093) (0.121) (0.120)
## chlorides -1.218** -1.200** -1.179** -2.564***
## (0.389) (0.390) (0.391) (0.408)
## residual.sugar -0.017 -0.017 -0.010
## (0.012) (0.012) (0.012)
## total_acidity 0.008 0.009
## (0.013) (0.013)
## sulphates 1.068***
## (0.113)
## ------------------------------------------------------------------------------------------------------
## R-squared 0.227 0.257 0.262 0.263 0.263 0.302
## adj. R-squared 0.226 0.256 0.261 0.261 0.261 0.300
## sigma 0.710 0.696 0.694 0.694 0.694 0.676
## F 468.267 276.595 188.675 142.024 113.650 114.880
## p 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1688.711 -1683.819 -1682.921 -1682.733 -1639.019
## Deviance 805.870 773.917 769.196 768.333 768.153 727.280
## AIC 3448.114 3385.421 3377.637 3377.842 3379.467 3294.037
## BIC 3464.245 3406.930 3404.523 3410.105 3417.106 3337.054
## N 1599 1599 1599 1599 1599 1599
## ======================================================================================================
The above linear model shows the intercepts for all the above variables.
There were few strong relationships that identified in bivariate analysis when combined together had impact on the quality of wine here are the observation: * Quality is high when volatile acidity and density are low * Quality gets high with more alcohol and less sulphates * Wine has good quality when the amount of alcohol is more and volatile acidity is less. Also there were few variables that when added alongwith alcohol showed no impact on the quality. * Density has the weakest correlations with quality * Residual sugar has no impact on quality
Earlier it was assumed that pH and citric acid will have great amount of impact on deciding the quality of the wine. But it was noted that these variables did not have significant impact on the quality of the wine. There were variables like volatile acidity and sulphates which if present in less amount will produce good quality wines.
It can be noted that the dataset provided contains average quality wines. There are very few observations for good and bad quality wines. This constraint makes it difficult to determine the factors that will impact the quality of wine. It can be noted there are very few observation betwwen 0 - 4 and 7-10 quality score.there are approximately 1200 records in the 5-7 quality score
The above plot show that alcohol,sulphates ,volatile.acidity and citric acid have strong corelation with the quality. From the above plot it can be noted that the mean of alcohol percent for good qulaity is aproox. 11%. The volatile acidity is less in amount (mean of 0.4) in the good quality wines when compared to bad and average quality wines. The mean (37.5) of citric acid is more when comapred to the same in bad and average quality wines. The sulphate amounts are present in very less amounts for all the 3 categories of wine. It can be seen the amount of these four variables can have an impact on the quality of the wine.
It is observed that we can get good quality wine when the volatile. Acidity and sulphates amount are less and alcohol content is high. There is no impact of density and pH on quality of wine.
It was thought that pH and density will contribute a major role on the quality of wine, before beginning the bivariate and multivariate analysis. It was only alcohol that played the part to the quality before and after the analysis.
After the analysis it was found that high amount of alcohol and less amount of sulphates and volatile acidity can produce good quality wines.
For future work if the dataset with good and bad rating wines is procured, the variables impacting the quality can be better determined.